Grid Computing and Checkpoint Approach

نویسنده

  • Pankaj gupta
چکیده

Grid computing is a means of allocating the computational power of a large number of computers to complex difficult computation or problem. Grid computing is a distributed computing paradigm that differs from traditional distributed computing in that it is aimed toward large scale systems that even span organizational boundaries. In this paper we investigate the different techniques of fault tolerance which are used in many real time distributed systems. The main focus is on types of fault occurring in the system, fault detection techniques and the recovery techniques used. A fault can occur due to link failure, resource failure or by any other reason is to be tolerated for working the system smoothly and accurately. These faults can be detected and recovered by many techniques used accordingly. An appropriate fault detector can avoid loss due to system crash and reliable fault tolerance technique can save from system failure. This paper provides how these methods are applied to detect and tolerate faults from various Real Time Distributed Systems. The advantages of utilizing the check pointing functionality are obvious; however so far the Grid community has not developed a widely accepted standard that would allow the Grid environment to consciously utilize low level check pointing packages. Therefore, such a standard named Grid Check pointing Architecture is being designed. The fault tolerance mechanism used here sets the job checkpoints based on the resource failure rate. If resource failure occurs, the job is restarted from its last successful state using a checkpoint file from another grid resource. A critical aspect for an automatic recovery is the availability of checkpoint files. A strategy to increase the availability of checkpoints is replication. Grid is a form distributed computing mainly to virtualizes and utilize geographically distributed idle resources. A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result varying resource availability becomes common place, often resulting in loss and delay of executing jobs. To ensure good performance fault tolerance should be taken into account. Here we address the fault tolerance in terms of resource failure. Commonly utilized techniques to achieve fault tolerance is periodic check pointing, which periodically saves the jobs state. But an inappropriate check pointing interval leads to delay in the job execution, and reduces the throughput. Hence in the proposed work, the strategy used to achieve fault tolerance is by dynamically adapting the checkpoints based on current status and history of failure information of the resource, which is maintained in the Information server. The Last failure time and Mean failure time based algorithm dynamically modifies the frequency of checkpoint interval, hence increases the throughput by reducing the unnecessary checkpoint overhead. In case of resource failure, the proposed Fault Index Based Rescheduling (FIBR) algorithm reschedules the job from the failed resource to some other available resource with the least Fault-index value and executes the job from the last saved checkpoint. This ensures the job to be executed within the deadline with increased throughput and helps in making the grid environment trust worthy.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Enhanced MSS-based checkpointing Scheme for Mobile Computing Environment

Mobile computing systems are made up of different components among which Mobile Support Stations (MSSs) play a key role. This paper proposes an efficient MSS-based non-blocking coordinated checkpointing scheme for mobile computing environment. In the scheme suggested nearly all aspects of checkpointing and their related overheads are forwarded to the MSSs and as a result the workload of Mobile ...

متن کامل

Improving Mobile Grid Performance Using Fuzzy Job Replica Count Determiner

Grid computing is a term referring to the combination of computer resources from multiple administrative domains to reach a common computational platform. Mobile Computing is a Generic word that introduces using of movable, handheld devices with wireless communication, for processing data. Mobile Computing focused on providing access to data, information, services and communications anywhere an...

متن کامل

Improving Mobile Grid Performance Using Fuzzy Job Replica Count Determiner

Grid computing is a term referring to the combination of computer resources from multiple administrative domains to reach a common computational platform. Mobile Computing is a Generic word that introduces using of movable, handheld devices with wireless communication, for processing data. Mobile Computing focused on providing access to data, information, services and communications anywhere an...

متن کامل

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...

متن کامل

Design and Analysis of Peer-to-Peer Fault-Tolerance Approach in a Grid Computing System

A grid computing system allows a large complex computing task to efficiently utilize high computing resources by splitting the task into many compute processes to be distributed and executed in parallel at many grid nodes. Under such paradigm, the system fault tolerance is the major issue as the failure of one grid node results in the task failure. Most fault tolerance techniques for a grid com...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011